The prediction of influenza-like illness using national influenza surveillance data and Baidu query data

Background Seasonal influenza and other respiratory tract infections are serious public health problems that need to be further addressed and investigated. Internet search data are recognized as a valuable source for forecasting influenza or other respiratory tract infection epidemics. However, the selection of internet search data and the application of forecasting methods are important for improving forecasting accuracy. The aim of the present study was to forecast influenza epidemics based on the long short-term memory neural network (LSTM) method, Baidu search index data, and the influenza-like-illness (ILI) rate. Methods The official weekly ILI% data for northern and southern mainland China were obtained from the Chinese Influenza Center from 2018 to 2021. Based on the Baidu Index, search indices related to influenza infection over the corresponding time period were obtained. Pearson correlation analysis was performed to explore the association between influenza-related search queries and the ILI% of southern and northern mainland China. The LSTM model was used to forecast the influenza epidemic within the same week and at lags of 1–4 weeks. The model performance was assessed by evaluation metrics, including the mean square error (MSE), root mean square error (RMSE) and mean absolute error (MAE). Results In total, 24 search queries in northern mainland China and 7 search queries in southern mainland China were found to be correlated and were used to construct the LSTM model, which included the same week and a lag of 1–4 weeks. The LSTM model showed that ILI% + mask with one lag week and ILI% + influenza name were good prediction modules, with reduced RMSE predictions of 16.75% and 4.20%, respectively, compared with the estimated ILI% for northern and southern mainland China. Conclusions The results illuminate the feasibility of using an internet search index as a complementary data source for influenza forecasting and the efficiency of using the LSTM model to forecast influenza epidemics.


Background
Seasonal influenza and other respiratory tract infections remain serious public health problems.The WHO estimates that annual epidemics of influenza result in ~ 1 billion infections, 3 ~ 5 million severe cases of influenza and 300,000 ~ 650,000 deaths globally [1,2].A previous study estimated that 88,100 influenza-associated excess respiratory deaths occurred in China from 2010-2015 [3].The National Health Commission of China reported that in 2020 and 2021, there were 1,145,278 and 668,246 influenza cases, with incidence rates of 81.5816 and 47.4008 per 100,000, respectively (http:// www.nhc.gov.cn/ jkj/ s3578/ 202103/ f1a44 8b7df 7d476 0976f ea6d5 58349 66. shtml, http:// www.nhc.gov.cn/ jkj/ s3578/ 202204/ 4fd88 a2 91d 914ab f8f7a 91f63 33567 e1.shtml.).These conditions pose significant social and economic burdens in China.Due to the occurrence of coronavirus disease 2019 (COVID- 19), the epidemic trend of influenza decreased rapidly in 2020.However, the intensity of the influenza epidemic has gradually increased since the spring of 2021.Thus, it is essential to establish an influenza surveillance system to monitor influenza epidemic trends.In China, the National Notifiable Infectious Disease Reporting System (NNIDRS) and hospital-based influenza surveillance system from the Chinese Influenza Center (CNIC) are used for surveillance communicable disease and for the surveillance of influenza or other respiratory viruses, respectively.The National Health Commission of China monthly reported the infectious disease data after one month, causing the influenza data to lag for one month, while the hospital-based surveillance system lagged for one or two weeks.Thus, it is necessary to establish a real-time influenza forecasting system to rapidly forecast influenza or other respiratory disease trends.
Currently, with the widespread use of the internet, people often seek help from the internet when they face health problems.When individuals search for information about health problems, including disease names, symptoms, therapies and prevention strategies, this information can be harnessed to monitor disease trends.In 2009, Ginsberg et al. first used Google query data to establish an influenza trend model to predict ILI rates in the U.S. in real time by monitoring millions of queries on their search engine; this approach overcomes the limitations of lag-time that are inherent to many traditional influenza surveillance systems [4].Initially, the forecast system could provide accurate predictions and was expanded to other countries or regions.However, the forecast model was not stable because the influenza forecast trend exceeded the peak of the epidemic by more than 140% in 2013 in the U.S. and sparked a hot discussion about the limitation of search data in infectious disease research [5].Since then, to predict disease epidemic trends, some research teams have attempted to assess the value of online search engines, including Google, Yahoo, Weibo, Baidu and Twitter, with different models and have obtained useful results.The field of digital epidemiology is still in an early stage, but it has begun to be used to forecast infectious disease epidemic trends, especially during the COVID-19 pandemic [6].Thus, in the current article, we establish a forecasting model with Baidu search index data and ILI data to forecast trends in the incidence of seasonal influenza or other respiratory viruses in China.
The long short-term memory (LSTM) neural network is a model architecture for recurrent neural networks (RNNs) that has been widely applied in text classification, time series classification and time series forecasting [7,8].Using multilayer and complex neural networks close to real values, a backwards propagation algorithm is used to continually shrink the fitting error [9].For infectious diseases, long short-term memory (LSTM) models, such as those for influenza and dengue, which have better accuracy, have been widely used in the prediction of different diseases and obtain good prediction results [10][11][12][13][14]. Previous research has confirmed that in the field of time series data analysis and prediction with complex relationships, LSTM in deep learning models yields better results than traditional machine learning methods [10,15,16].In this paper, we incorporated the ILI and online search indices of different keywords into the LSTM model to forecast trends in influenza or other respiratory viruses and validate whether online search index data can improve forecasting accuracy.

Data collection and processing
The weekly ILIs for both northern and southern mainland China were obtained separately from the Influenza Weekly Report published by the Chinese Influenza Center (CNIC) (http:// ivdc.china cdc.cn/ cnic/) from January 2018 to December 2021.ILI patients were defined as outpatients of any age with acute respiratory infection syndrome with fever ≥ 38 °C and cough or sore throat.Influenza surveillance Sentinel hospitals distributed throughout all provinces of mainland China uploaded ILI case counts and total physician visit data to the Chinese Influenza surveillance system on Monday of the following week.The CNIC reports the aggregated data at the end of the next week, including the ILI%, which is the proportion of patients with ILI divided by the total number of physician visits.Due to the vast difference in the influenza epidemic situation among different regions, the CNIC releases the ILI% for northern and southern mainland China separately.The provinces in northern mainland China include Beijing, Tianjin, Hebei, Shanxi, Shaanxi, Inner Mongolia, Liaoning, Jilin, Heilongjiang, Shandong, Henan, Tibet, Gansu, Qinghai, Ningxia, and Xinjiang, and the provinces in southern mainland China include Shanghai, Jiangsu, Zhejiang, Anhui, Fujian, Jiangxi, Hubei, Hunan, Guangdong, Guangxi, Hainan, Chongqing, Sichuan, Guizhou, and Yunnan.

Online Baidu index
Data on the Baidu search platform of 31 provinces were obtained from the Baidu Index (http:// index.baidu.com) separately, an open online data service platform, on which we can obtain the daily search index for every keyword of every province.After filtering, we retrieved 30 keywords containing influenza-related symptom keywords or the keywords 'flu' or 'influenza' or the keywords of influenza prevention strategy for every province based on previous studies [4,16,17].First, we collected the search indices of 30 selected influenza-related keywords from every province every day from 2018 to 2021 from the Baidu Index.Then, we obtained the weekly Baidu search index by adding the daily index.Finally, we computed the weekly index of the keywords northern and southern mainland China to obtain the total Baidu index of different keywords for northern and southern mainland China.

Descriptive analysis
A descriptive analysis was used to reveal the characteristics of the current ILI% and the current search indices of different keywords on the Baidu search platform.The Pearson correlation coefficient was calculated to explore the association between the influenza-related search indices and the ILI% in northern and southern mainland China.We also analysed the correlation between the previous week's search index (from week 1 to 4) and the current ILI.A correlation coefficient closer to 1 or -1 indicates a stronger correlation, and a correlation coefficient closer to 0 indicates a weaker correlation.We calculated the Pearson correlation coefficient between each variable to observe the correlations between the variables.After performing the correlation analysis, we used the variables with correlation coefficients above 0.4 to develop forecasting models to improve the prediction accuracy based on previous studies [18,19].This study statistically analysed the usefulness of these potential predictors in forecasting ILI% and quantified their relationships during the influenza or respiratory illness seasons.

Module formulation
LTSM was used to predict the ILI% with the correlated search queries in China.LSTM is a special recurrent neural network (RNN) that is used to process sequence data.Compared with normal neural networks, RNNs perform well in processing sequential changes in data, but gradient disappearance and gradient explosion are inevitable.To solve this problem, an LSTM network is proposed for long sequence data, which has better performance than an RNN.
To implement information protection and control, there are an input gate, forget gate and output gate and a memory cell in each memory block.
The forget gate is controlled by a sigmoid to determine which information obtained from the previous moment can be retained at the current moment.The formula in the forget gate is F1.where W f is the weight matrix of the forget gate, x t is the current input, h t−1 is the previ- ous output of the memory block, b f is the bias term of the forget gate, and σ is the sigmoid function.
The input gate decides how much information from the input x t can be reserved.The formula for the input gate is F2.Here, W i is the weight matrix of the input gate, b i is the bias term of the input gate, and the other parameters are the same as those of F1.
The output gate determines the degree of dependence of the input x t and the current memory cell.The formula for the input gate is F3.Here, W o is the weight matrix of the output gate, b o is the bias term of the output gate, and the other parameters are the same as those in F1.
In each t, there is a memory cell, and the cell state is important for LSTM, which allows the LSTM to select memory.The formulas for determining the cell state are F4, F5 and F6.where W c is the weight matrix of the cur- rent cell state, b c is the bias term of the current cell state, and tanh is an active function.
In this study, the absolute values of the Baidu index data and the ILI% were not on the same order of magnitude; we normalized all the data to be between 0 and 1 for further analysis and training.We defined the LSTM model with three layers, and there were 512 neurons in each layer.To reduce overfitting, we set a bias regularizer (1) with regularization L2 (0.005).To train the model, we fit the model for 150 training epochs with a batch size of 64, and the learning rate was 0.0001.In the process, 126 sets of data (from week 201801 to week 202022) were used as the training set, 42 sets of data (from week 202023 to week 202111) were used as the validation set, and 41 sets of data (from week 202212 to week 202152) were used as the test set for model prediction.The obtained data were compared with the actual data to observe the model's fitting effect.Moreover, to reduce overfitting, the dataset was augmented by averaging the data values of the two adjacent columns in turn and inserting the obtained average value between the two columns.Thus, the original dataset was expanded to eight times through threetime amplification.For northern and southern mainland China, both the ILI% and Baidu search indices were input into the LSTM module to train, validate and forecast, respectively.Four metrics were used to measure the performance of the LSTM model, namely, the R 2 , mean square error (MSE), root mean square error (RMSE) and mean absolute error (MAE), which measure the accuracy of a forecasting method in statistics.An R 2 close to 1 and MAE and MSE close to 0 indicate the good prediction effect of the model.The RMSE is sensitive to extreme errors or very small errors in a set of measurements and can reflect the accuracy of the prediction.

ILI% trend in China
The ILI was reported every week of each year, and the ILI% presented a regular seasonal high incidence in northern and southern mainland China from 2018 to 2021.The average weekly ILI% was 2.79% and 3.62% in northern and southern mainland China, respectively.For northern mainland China, the highest ILI% was in the 1st week in 2018 (5.8%), the 6th week in 2019 (6.2%), the 5th week in 2020 (8.5%) and the 52nd week in 2021 (4.1%).For southern mainland China, the highest ILI% was observed in the 3rd week in 2018 (6.7%), the 6th week in 2019 (6.8%), the 5th week in 2020 (8.0%) and the 23rd week in 2021 (4.4%).During the period from January 2018 to March 2020, the highest ILI% was observed in the winter season in mainland China, and the duration of high ILI% was longer in the southern region than in the northern region; however, during the period from April 2020 to March 2021, the ILI% was lower than the average ILI%.Beginning in April 2021, the ILI% returned to its original level gradually, and two small peaks occurred in June and December 2021.(Fig. 1).

Baidu search queries
We retrieved information on 30 search terms, including different influenza names, influenza symptoms, influenza drugs and mask sales.For all 30 terms, the weekly Baidu search index of northern and southern mainland China was calculated based on the diary search index of every province.The weekly average numbers of different keywords in the Baidu search indices for northern and southern mainland China are compared in Table 1.Pearson correlation analysis was also conducted between the Baidu search index and the ILI% across different lag periods, including the current week and a lag of one week, at lags of two, three and four weeks.The correlation coefficients of ILI% and different Baidu search queries varied widely in northern and southern mainland China.For northern mainland China, 24 terms of the Baidu search query statistics were correlated with the ILI% and with the lag weeks, with a correlation coefficient above 0.4,   1.

Evaluation scheme
First, the original ILI% was solely input into the LSTM module as the gold standard for predicting the ILI% activity trend.Second,, the ILI% was simultaneously input into the LSTM module with a Baidu search index with a correlation coefficient above 0.4; these factors were divided into five categories with different lag times -namely, ILI% + all of the Baidu index, ILI% + the index of influenza name, ILI% + the index of influenza therapy and drug, ILI% + the index of influenza symptoms and ILI% + the index of mask -to compare the effects of the different combinations with the calculated MSE, RMSE, MAE and R 2 .For northern mainland China, the R 2 of ILI% + the index of masks with one lag week module reached 0.9055, which was greater than the corresponding values of the ILI% alone, and other combinations of ILI% + the Baidu search index.Similarly, the MAE was 0.14325, and the MSE was 0.02762, which were lower than the corresponding values of the ILI% alone, the other combinations of ILI% + the Baidu search index (Table 2).For southern mainland China, the R 2 of ILI% + the index of the influenza name module reached 0.75579, which was higher than the corresponding values of the ILI% alone, other combinations of ILI% + the Baidu search index.Similarly, the MAE was 0.17832, and the MSE was 0.05211, which were lower than the corresponding values of the ILI% alone, and other combinations of ILI% + the Baidu search index(Table 2).These results showed that ILI% + the index of masks with one lag week and ILI + the index of influenza name had the best prediction effects for northern and southern mainland China, respectively.The LSTM module reduced the RMSE predictions by 16.75% and 4.20% compared with the estimated ILI% for northern and southern mainland China, respectively.We then constructed a prediction diagram, and the results showed that the actual values were consistent with each other and that the accuracy was high (Figs. 2 and 3).

Discussion and conclusion
The dynamic assessment and forecasting of epidemic trends are important parts of the prevention and control of infectious diseases.The ILI% is a good indicator for detecting respiratory illness and influenza viruses.
To predict respiratory illness trends, the ILI% was used to predict trends in influenza virus or respiratory illness incidence.To predict the trend precisely, several researchers have used the ILI% and search indices to predict respiratory illness incidence via different methods, such as the seasonal autoregressive integrated moving average (SARIMA) model and linear regression models [20][21][22][23].However, the results have shown that the prediction accuracy is not high [16,24].With the development of artificial intelligence, machine learning algorithms have shown advantages in prediction and recognition.LSTM is an advanced RNN with the ability to learn time patterns and store useful memories longer.This type of LSTM has been widely used to analyse and predict time series data in various sectors and was confirmed to outperform some statistical-based algorithms.[10,25].At present, there are few reports on the prediction of influenza infection with an RNN combined with the Baidu Index [16].In this study, we reviewed the Baidu search index related to ILI% and proposed an LSTM model to predict the occurrence of respiratory disease or influenza virus in northern and southern mainland China; the results confirmed that the Baidu search intensity of keywords is a useful disease surveillance tool and further showed that the ILI% + Baidu search index performed significantly well as a predictor compared with the ILI% alone.Previous research on disease predictions has shown that data from social media, including Google, Twitter and other media containing important information, can be used to effectively predict disease incidence, and there is a strong correlation between disease searchers and disease cases [26][27][28][29][30].Moreover, the search behaviour of the user could show the degree of concern of the user to a certain event or something.Information search behaviour is a targeted information acquisition behaviour carried out by users to meet their specific needs [31].When people around them suffer from influenza or have influenza-like symptoms, many people tend to search for influenza prevention measures, flu symptoms and other related information from the internet.In this study, our results revealed that 7 Baidu search queries strongly correlated with ILI%, not only in northern mainland China but also in southern mainland China; these queries included H1N1 pdm2009, influenza, Tamiflu, oseltamivir, oseltamivir granules, symptoms of influenza A, and symptoms of influenza.However, in northern mainland China, the other 17 items still had a strong correlation with the ILI%.To pursue this reason, some possible intrinsic limitations in the application of search data for epidemic disease surveillance should be considered.For instance, web users' educational level, regional background, cognition level and disease epidemic trends can influence users' search habits and keywords.For influenza, there is only one epidemic peak annually in northern mainland China, while there are two epidemic peaks in southern mainland China.Differences in influenza epidemic trends may influence cognition levels and thus influence behavioural habits.In northern mainland China, most people suffer from influenza or respiratory illness in winter; thus, the search index increases rapidly in winter, including some symptoms and use of masks.However, in southern mainland China, most people suffer from influenza or respiratory illness twice a year [32]; thus, they learn about epidemic trends; thus, they do not focus on symptoms or prevention measures.Therefore, we cannot integrate all the search indices into the model to forecast disease epidemic trends because not all Baidu indices are strongly correlated with the ILI%.According to the LSTM model, the ILI% + mask at lag 1 week was a good predictor of the ILI% trend in northern mainland China; however, for southern China, the ILI% + influenza name was the best predictor, which has not been discussed in other studies.Therefore, for disease prediction, high correlation data and classification data can improve the accuracy, and some classification data further strengthen the prediction accuracy.Our study revealed that geographical location may affect the prediction of disease epidemics.
The algorithms and computational techniques used for computation and analysis still need to be carefully refined, tuned and calibrated to avoid overfitting risk in big data.To avoid overfitting issues in the LSTM model, in our study, after three amplifications, the data (including one ILI sequence and twenty-four or seven Baidu Index sequences) were found to be sufficient to improve the robustness of our training effect at the data level.Second, the LSTM model is a lightweight and appropriate model for solving our target problem.This model provides several methods for reducing overfitting, including increasing the number of LSTM layers, increasing the number of LSTM units, eliminating dropouts, using regularization, and using additional training data.In our paper, to reduce overfitting, we modified the LSTM model by adding three layers and 512 units in each layer and used biasregularizer to train the model.
The prediction of epidemic trends due to influenza or respiratory illness is a topic of intense discussion worldwide.Adding the Baidu search index of influenzarelated keywords to influenza forecasting can effectively improve the accuracy of influenza forecasting in China.Furthermore, the influence of the search indices of different keywords on the accuracy of the prediction results varies.The next step of this research will involve incorporating relevant meteorological factors into the model, hoping to construct more accurate prediction models of influenza and respiratory diseases through multidimensional factors.

Fig. 1
Fig. 1 Different ILI% of Northern and Southern Mainland China.Note:1801 represents the first week in 2018

Fig. 2 Fig. 3
Fig. 2 Actual and predicted ILI% of Northern mainland China in 2021.A Training and Validation of LSTM model with ILI%.B Actual and predicted ILI% with ILI%.C Training and Validation of LSTM model with ILI% + mask with one lag week.D Actual and predicted ILI% with ILI% + influenza + mask with one lag week.Note: 202112 represents the twelfth week in 2021

Table 1
Pearson association between Baidu search terms and ILI% in Northern and Southern mainland China from 2018-2021

Table 2
The different Metric of different Baidu search Index and ILI% with LSTM